This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental results indicated that the proposed model is effective in terms of both naturalness and robustness of timing.
translated by 谷歌翻译
最近的文本到语音(TTS)的质量与人类的质量相当。但是,其在口语对话中的应用尚未得到广泛研究。这项研究旨在实现与人类对话非常相似的TT。首先,我们记录并抄录实际自发对话。然后,提出的对话TTS分为两个阶段:第一阶段,各种自动编码器(VAE) - VITS或高斯混合物变化自动编码器(GMVAE) - 培训了训练,从端到端文本对语音(VIT),最近提出的端到端TTS模型。从语音中提取潜在的口语表示的样式编码器与TTS共同培训。在第二阶段,对风格预测指标进行了训练,以预测从对话历史中综合的说话风格。在推断期间,通过将样式预测器预测的语言样式表示为VAE/gmvae-vits,可以以适合对话背景的样式合成语音。主观评估结果表明,所提出的方法在对话级别的自然性方面优于原始VIT。
translated by 谷歌翻译
本文提出了一种具有多粒度潜变量的分层生成模型,以综合表达语音。近年来,将细粒度的潜在变量引入了文本到语音合成中,使得韵律和讲话方式的精细控制能够进行综合演讲。然而,当通过从标准高斯先前抽样获得这些潜变量时,言语的自然度降低。为了解决这个问题,我们提出了一种用于建模细粒度潜在变量的新框架,考虑到输入文本,分层语言结构和潜在变量的时间结构的依赖性。该框架包括多粒子变形AutoEncoder,条件先前和多级自回归潜伏转换器,以获得不同的时间分辨率潜变量,并通过拍摄来对较粗级别的潜入变量进行样本考虑到输入文本。实验结果表明,在合成阶段在没有参考信号的情况下采样细粒潜变量的适当方法。我们拟议的框架还提供了整个话语中说话风格的可控性。
translated by 谷歌翻译